CCDL 2018
In this notebook, we’ll cover how to do conversion between different gene identifiers. Different tools use different ids. PLIER (Mao, et al. bioRxiv. 2017.), which we’ll use shortly, requires that we use gene symbols and the refine.bio processed data uses Ensembl gene ids.
For this part of the workshop, we’ll be using GSE37382, primary medulloblastoma sample data from Northcott, et al. Nature. 2012..
We’ll need something that maps between different types of identifiers to do the conversion. Here, we’ll use the org.Hs.eg.db package from Bioconductor.
The Hs corresponds to H. sapiens.
For more information about using Bioconductor annotation packages, see this vignette about AnnotationDbi.
library(org.Hs.eg.db)
# magrittr pipe
`%>%` <- dplyr::`%>%`
# the directory where the refine.bio processed data lives, it's also where we
# will write the output after the conversion
data_dir <- file.path("data", "GSE37382")
# refine.bio processed expression file
exprs_file <- file.path(data_dir, "GSE37382_SCAN.tsv")
# the file used for mapping between gene ids that is output
mapping_file <- file.path(data_dir,
"genesymbol_ensembl_mappings_used_for_GSE37382.tsv")
# the prepped expression data output file
output_file <- file.path(data_dir, "GSE37382_SCAN_symbol_mean_agg.tsv")
We’ll read in the data with a data.table function because it is much faster than base R.
exprs_df <- data.table::fread(exprs_file, data.table = FALSE)
Let’s take a look at the expression data.frame.
exprs_df[1:5, 1:5]
We can see that the first column, called Gene, contains the Ensembl gene ids. Let’s take a look at the dimensions of exprs_df.
dim(exprs_df)
[1] 21661 286
This means there are 21661 rows, which correspond to genes, and 285 samples because the first column contains gene identifers.
PLIER uses gene symbols, but org.Hs.eg.db can be used for a number of different conversions. Let’s take a look at what it contains.
keytypes(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GO" "GOALL" "IPI" "MAP" "OMIM"
[16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
[21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
[26] "UNIPROT"
If we had data that used Entrez gene ids, we could use ENTREZID to prep the data for use with PLIER.
Now for the conversion bit! We can use mapIds to do the conversion.
Sometimes the gene id we’re converting from will map to many ids of the type we are converting to or vice versa. mapIds will pick the first one it comes across when it’s using the default arguments or options (setting multiVals = "first"). We don’t necessarily want that behavior, so we can make a list.
# don't replace the identifiers in exprs -- we're concerned about 1:many mappings
mapped_list <- mapIds(org.Hs.eg.db, keys = exprs_df$Gene, column = "SYMBOL",
keytype = "ENSEMBL", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
We can see that this 1:many mapping is the case. mapped_list is a list where the names are Ensembl gene ids and the elements are the gene symbols they map to.
head(mapped_list)
$ENSG00000000003
[1] "TSPAN6"
$ENSG00000000005
[1] "TNMD"
$ENSG00000000419
[1] "DPM1"
$ENSG00000000457
[1] "SCYL3"
$ENSG00000000460
[1] "C1orf112"
$ENSG00000000938
[1] "FGR"
Where are the 1:many mappings? Let’s see what one of those looks like.
list_index <- which(lapply(mapped_list, length) > 1)[2]
mapped_list[list_index]
$ENSG00000011021
[1] "CLCN6" "NPPA-AS1"
How many of the mappings look like this?
sum(lapply(mapped_list, length) > 1)
[1] 123
If we wanted to, we could manually resolve these by looking at other resources.
Let’s get the data into a data.frame for joining.
mapped_df <- reshape2::melt(mapped_list)
head(mapped_df)
Let’s rename those columns.
colnames(mapped_df) <- c("SYMBOL", "ENSEMBL")
head(mapped_df)
What happens to those 1:many mapping cases?
mapped_df %>%
dplyr::filter(ENSEMBL == "ENSG00000011021")
Let’s write this to file for posterity. If we find one gene we’re really interested in, we’ll want to go back to this mapping file and make sure it’s not ambiguous by using other resources.
readr::write_tsv(mapped_df, path = mapping_file)
Now we’ll want to join the mappings to our expression data.
# let's use the mappings to reannotate our data
annot_exprs_df <- mapped_df %>%
# removing anything in the SYMBOL column that is NA
dplyr::filter(!is.na(SYMBOL)) %>%
# join by Ensembl gene ids, retaining only Ensembl gene ids in both
dplyr::inner_join(y = exprs_df, by = c("ENSEMBL" = "Gene"))
What does this new data.frame look like?
annot_exprs_df[1:5, 1:5]
We’ll want to drop the ENSEMBL column.
annot_exprs_df <- annot_exprs_df %>%
dplyr::select(-ENSEMBL)
annot_exprs_df[1:5, 1:5]
Note that if Ensembl gene id mapped to more than one gene symbol, the expression value in that ENSG row was copied to each of the rows using that gene symbol. This should be okay for our purposes.
annot_exprs_df %>%
dplyr::filter(SYMBOL %in% c("CLCN6", "NPPA-AS1"))
We’ll want to check if any of the gene symbols are duplicated, because this will cause issues downstream.
If there are multiples of a gene symbol, let’s take the average expression value. Again, this should be okay for our particular use case.
# how many of the symbols are duplicated?
sum(duplicated(annot_exprs_df$SYMBOL))
[1] 3
Let’s take a look at an example.
dup_index <- which(duplicated(annot_exprs_df$SYMBOL))[1]
dup_symbol <- annot_exprs_df$SYMBOL[dup_index]
annot_exprs_df %>%
dplyr::filter(SYMBOL == dup_symbol)
# checking for missing values -- we'll want to deal with these when we summarize
any(is.na(annot_exprs_df))
[1] FALSE
# can collapse to mean values for duplicated genes without removing NAs, then
agg_exprs_df <- annot_exprs_df %>%
# group by gene identifier
dplyr::group_by(SYMBOL) %>%
# for each column -- take the mean expression value
dplyr::summarise_all(mean)
Let’s take another look at the CALM3 example.
agg_exprs_df %>%
dplyr::filter(SYMBOL == dup_symbol)
Is this consistent with what we expect?
Note that the values in each set of parentheses below represent the gene expression value of each CALM3 duplicate for that sample.
# GSM917011
mean(c(1.677279, 2.091462))
[1] 1.88437
# GSM917014
mean(c(1.770283, 2.113677))
[1] 1.94198
Let’s write our prepped data to file.
# write to file!
readr::write_tsv(agg_exprs_df, path = output_file)
Record session info for reproducibility & provenence purposes.
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid parallel stats4 stats graphics grDevices utils datasets
[9] methods base
other attached packages:
[1] ggsignif_0.4.0 dplyr_0.7.8
[3] ggplot2_3.1.0 PLIER_0.99.0
[5] qvalue_2.14.1 rsvd_1.0.0
[7] knitr_1.21 glmnet_2.0-16
[9] foreach_1.4.4 Matrix_1.2-14
[11] pheatmap_1.0.10 gplots_3.0.1
[13] RColorBrewer_1.1-2 ConsensusClusterPlus_1.46.0
[15] ComplexHeatmap_1.20.0 bindrcpp_0.2.2
[17] org.Hs.eg.db_3.7.0 AnnotationDbi_1.44.0
[19] IRanges_2.16.0 S4Vectors_0.20.1
[21] Biobase_2.42.0 BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] jsonlite_1.6 bit64_0.9-7 splines_3.5.1 gtools_3.8.1
[5] assertthat_0.2.0 blob_1.1.1 yaml_2.2.0 pillar_1.3.1
[9] RSQLite_2.1.1 lattice_0.20-35 glue_1.3.0 digest_0.6.18
[13] colorspace_1.4-0 htmltools_0.3.6 plyr_1.8.4 pkgconfig_2.0.2
[17] GetoptLong_0.1.7 purrr_0.2.5 scales_1.0.0 gdata_2.18.0
[21] tibble_1.4.2 withr_2.1.2 lazyeval_0.2.1 magrittr_1.5
[25] crayon_1.3.4 evaluate_0.12 memoise_1.1.0 rsconnect_0.8.12
[29] tools_3.5.1 data.table_1.11.8 hms_0.4.2 GlobalOptions_0.1.0
[33] stringr_1.3.1 munsell_0.5.0 cluster_2.0.7-1 compiler_3.5.1
[37] caTools_1.17.1.1 rlang_0.3.1 iterators_1.0.10 rstudioapi_0.8
[41] rjson_0.2.20 circlize_0.4.6 base64enc_0.1-3 bitops_1.0-6
[45] rmarkdown_1.11 gtable_0.2.0 codetools_0.2-15 DBI_1.0.0
[49] reshape2_1.4.3 R6_2.3.0 bit_1.1-14 bindr_0.1.1
[53] KernSmooth_2.23-15 shape_1.4.4 readr_1.3.0 stringi_1.2.4
[57] Rcpp_1.0.0 tidyselect_0.2.5 xfun_0.4